Fix ResourceWatcher Data Race and Redis Connection Leaks by chungeun-choi · Pull Request #1741 · OT-CONTAINER-KIT/redis-operator

chungeun-choi · 2026-04-16T05:55:14Z

Description

This PR severely boosts the operator’s concurrent throughput and fixes internal blocking bottlenecks when orchestrating multiple RedisReplication resources efficiently.

The detailed changes include:

Fix ResourceWatcher Thread-Safety: Replaced the value receiver with a pointer receiver (w *ResourceWatcher) and implemented sync.RWMutex to protect the watched map against Data Races when MAX_CONCURRENT_RECONCILES > 1.
Fix TCP Connection Leaks: Restructured GetRedisNodesByRole to wrap the configureRedisReplicationClient and defer redisClient.Close() execution in an anonymous function. This ensures stale connections close immediately per iteration rather than hogging connections inside the for loop until return.
Optimize Redundant Topologies: Re-factored the redisreplication_controller.go to merge reconcileRedis and reconcileStatus into a single reconcileRedisAndStatus function yielding a ~50% reduction in concurrent TCP handshakes per reconcile step.

Fixes #ISSUE

Type of change

Bug fix (non-breaking change which fixes an issue)

Checklist

Tests have been added/modified and all tests pass.
Functionality/bugs have been confirmed to be unchanged or fixed.
I have performed a self-review of my own code.
Documentation has been updated or added where necessary.

Additional Context

In a local constrained environment (Docker Desktop) orchestrating 30 RedisReplication clusters simultaneously:

Before fixing: Throughput severely choked around ~3.75 successfully labeled masters per minute due to connection exhaustion and constant TCP timeouts blocking workers.
After patching: Throughput rocketed to ~13.12 masters per minute (about ~3.5x improvement) dynamically consuming all the local constraints without causing go routine leaks or panic logs.

This patch dramatically improves the throughput of the Redis operator during large-scale provisioning contexts when MAX_CONCURRENT_RECONCILES > 1. 1. Fix controllerutil ResourceWatcher concurrent safety (Data Race fix) 2. Wrap GetRedisNodesByRole defer logic in func to prevent TCP connection leak 3. Consolidate reconcileRedis and reconcileStatus to avoid redundant topology calls Signed-off-by: chungeun-choi <cucuridas@gmail.com>

Significantly increase the default rate limits (QPS) for kube client to prevent aggressive client-side throttling and delays during scale-out events. Signed-off-by: chungeun-choi <cucuridas@gmail.com>

chungeun-choi requested review from drivebyer, iamabhishek-dubey and shubham-cmyk as code owners April 16, 2026 05:55

perf(manager): increase default Kube client QPS to 300

86a56cd

Significantly increase the default rate limits (QPS) for kube client to prevent aggressive client-side throttling and delays during scale-out events. Signed-off-by: chungeun-choi <cucuridas@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix ResourceWatcher Data Race and Redis Connection Leaks#1741

Fix ResourceWatcher Data Race and Redis Connection Leaks#1741
chungeun-choi wants to merge 2 commits intoOT-CONTAINER-KIT:mainfrom
chungeun-choi:fix/performance-optimization

chungeun-choi commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

chungeun-choi commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant